Power Simulation Exercise: actual data

Vincent Bagilet https://www.sipa.columbia.edu/experience-sipa/sipa-profiles/vincent-bagilet (Columbia University)https://www.columbia.edu/ , Léo Zabrocki https://www.parisschoolofeconomics.eu/en/ (Paris School of Economics)https://www.parisschoolofeconomics.eu/en/
2021-04-16

Purpose of the document

In this document, we carry out a simulation exercise to evaluate the inference properties of different research designs aiming at measuring the short-term effects of air pollution on health. We consider different types of quasi experiments and for each associate an identification strategy.

This document focuses on running the simulations. The results are analyzed and discussed in another document.

Description of the analysis

Data

We use data from 18 cities in France over the 2013-2018 period. The data set contains records of hospital admissions and deaths, mean concentration data for various air pollutants, a bunch of weather variables and calendar control variables (such as school holidays for instance). All variables are at the daily and city level. There is therefore a unique observation per date and per city in the data set.

Background on selected quasi experiments

Note that, for now we focus on quasi experiments for which treatment is binary and homogeneous, both in time and across individuals. It is also not correlated with covariates (apart in the case of the air pollution alerts in which it is correlated with the pollutant level).

Treatment on “random” days

Here, we consider interventions leading to changes in air pollution levels on some random days. Examples of such interventions include transportation strikes. Of course, dates are often not defined as random and are likely to be correlated with unobserved variables. In the present setting, we first consider the golden standard case in which these days are actually defined at random. One can think of this case as a Randomized Control Trial: it represents what would happen if an experimenter could implement a treatment increasing pollution on random days.

The effect of the treatment is estimated using a RCT type of model. The overall idea of the RCT is to compare the average number of deaths or hospital admissions in cities with treatment to cities with no treatment on the same day, controlling for differences across cities.

National intervention

Here consider interventions leading to changes in air pollution levels in all cities after a given date. Examples of such interventions include an air pollution reduction policy at a national level or a change in regulations at a national level leading to an increase in pollution level.

The effect of the treatment is estimated using an ITS type of model. The overall idea of the ITS is to compare the average number of deaths or hospital admissions before and after treatment.

Sub-national intervention

Here, we consider interventions leading to changes in air pollution levels in a subset of cities after a given date. Examples of such interventions include an air pollution reduction policy at a sub-national level or a change in regulations at a sub-national level leading to an increase in pollution level.

The effect of the treatment is estimated using a DID type of model. The overall idea of the DID is to compare the average number of deaths or hospital admissions in cities with treatment to cities with no treatment on the same day, controlling for differences across cities.

Rolling intervention

Here, we consider interventions leading to changes in air pollution levels in a all cities but the change happens after a different date for each city. Examples of such interventions include an air pollution reduction policy at a national level whose implementation is rolled out.

The effect of the treatment is also estimated using a DID type of model.

Air pollution alert

Here consider interventions that affect exposure to air pollution when air pollution levels reach a given threshold. Examples of such interventions include air pollution alerts: when pollution reaches a certain level, alerts are released, inviting people to reduce their exposure.

The effect of the treatment is estimated using a RD type of model. The overall idea of the RD is to compare days just below the threshold to days just above the threshold (where exposure and health impacts are thus lower). The assumption is that days just below and just above the threshold are comparable.

Analysis

Overall setting

To sum up the analysis, the aim is to measure the inference properties of different research designs aiming at measuring the short-term effects of air pollution on health. For simplicity, consider the daily number of death as the output variable of interest for now. We proceed as follows:

  1. We define the length of the study period, ie the number of observations, and a true effect size, \(\beta_0\), representing the percentage change in the number of deaths in response to the treatment.
  2. We draw a study period randomly
  3. We define the treated days, ie we define a treatment variable \(T_{ct}\) equal to 1 if the city \(c\) is treated at time \(t\) and zero otherwise.
  4. We compute the number of deaths when treated for each observation (\(Y(1)\)) based on the “true” effect of the treatment. We then derive \(Y_{obs} = Y(0) × (1-T) + Y(1) × T\).
  5. We estimate our model with \(Y_{obs}\) as a dependent variable and retrieve \(\hat{\beta}\) and the associated p-value.
  6. We run the steps 2 to 5 \(n_{iter}\) times.
  7. We compute the average bias, type M, type S and power.

Varying “parameters”

Note that there are several parameters we can vary in order to evaluate the performance of the different methods: the identification strategy, the number of observations, the proportion of treated days, the true effect size and the model. In order to limit the number of simulations and for clarity, we only modify one parameter at the time, keeping others constant and equal to a baseline value. When we vary a parameter, we consider the following values:

We consider this set of parameters for each quasi-experiment.

Deviation from the ideal case

In addition to considering an “ideal” case in which treatment is allocated randomly, the effect of the treatment is homogeneous and with perfect compliance, we can consider deviations from this “ideal” case:

Allocation Compliance Effect of the treatment
Random Yes Homogeneous
Random No Homogeneous
Correlated with covariates Yes Homogeneous
Random Yes Heterogeneous

Deviating from the ideal case therefore yields a very large number of cases. We limit ourselves to a sample of these cases.

Actual implementation

In this section, we follow the steps described in the section “analysis” and carry out our analysis. We basically define a function for each step.

Loading and Formatting Data

We load the packages and the data and wrangle it into a format well suited for this analysis.

Function definitions

Drawing the study period

First, we create a function to randomly draw a study period of a given length. For simplicity, we choose to have the same study period for each city. This also seems realistic; a study focusing on several cities would probably consider a unique study period.

This function randomly selects a starting date for the study, early enough so that the study can actually last the number of days chosen, and returns a boolean vector indicating whether each date is in the study or not.

draw_study_period <- function(dates, n_days_study = 200) {
  begin_study <- sample(seq.Date(min(dates), max(dates) - n_days_study, "day"), 1)
  
  dplyr::between(dates, begin_study, begin_study + n_days_study)
}

Defining the treatment

Then, we create a function to draw the treatment. This function returns a boolean vector, stating whether each observation is in the treatment group or not.

For each treatment, we define the proportion of treated units (p_treat). The different treatments correspond to different types of quasi-experiments. Note that we will consider some identifications strategies that will only consider restricted samples (RDD and RDIT). For coding simplicity, to get an accurate proportion of treated units, we directly restrict the sample in this function. For the RDIT, we create another type of quasi experiment (“national_short”), even though the quasi experiment is actually the same as for a national intervention.

In this function, we made some simulations decisions:

draw_treated <- function(data, p_treat = 0.5, quasi_exp = "random_day") {
  
  if (quasi_exp == "random_days") {
    treated <- rbernoulli(length(data[["date"]]), p_treat)
  } else if (quasi_exp == "national") {
    num_date <- as.numeric(data[["date"]])
    treated <- (num_date >= quantile(num_date, 1 - p_treat))
  } else if (quasi_exp %in% c("GAM", "IV")) {
    # treated <- TRUE
    treated <- rbernoulli(length(data[["date"]]), p_treat)
    treated <- ifelse(treated, treated, NA) 
  } else if (quasi_exp == "local") {
    treated_cities <- unique(data[["city"]]) %>%
      sample(size = round(length(.) * min(p_treat * 2, 1)))
    treated <- (data[["city"]] %in% treated_cities &
                  data[["date"]] >= median(data[["date"]]))
  } else if (str_starts(quasi_exp, "alert")) {
    pollutant <- str_extract(quasi_exp, "(?<=_).+")
    threshold_pos <- rbeta(1, 20, 2) 
    treated <- data %>%
      group_by(.data$city) %>%
      mutate(
        threshold = quantile(.data[[pollutant]], threshold_pos, names = FALSE),
        treated = (.data[[pollutant]] >= threshold),
        bw = cut_width(
          .data[[pollutant]],
          width = length(.) * 2 * p_treat,
          center = unique(threshold) #threshold center of an interval
        ),
        treated = ifelse(
          threshold > as.numeric(str_extract(bw, "([:digit:]|\\.|-)+(?=,)")) &
            threshold < as.numeric(str_extract(bw, "(?<=,)([:digit:]|\\.)+")), treated, NA)
      ) %>%
      ungroup() %>%
      .$treated
  } else if (quasi_exp == "rolling") {
    treated <- data %>%
      group_by(.data$city) %>%
      mutate(
        treated = (.data$date > sample(
          seq.Date(
            min(.data$date) + (1 - 2*p_treat)*length(.data$date),
            max(.data$date), "day"
          ), 1))
      ) %>%
      ungroup() %>%
      .$treated
  } else if (quasi_exp == "national_short") {
    dates <- unique(data[["date"]])
    bw <- floor(min(p_treat, 0.5) * length(dates))
  
    date_start_treat <- sample(seq.Date(min(dates) + bw, max(dates) - bw, "day"), 1)
    
    treated <- (data[["date"]] >= date_start_treat)
    treated <- ifelse(
      data[["date"]] > date_start_treat + bw | data[["date"]] < date_start_treat - bw,
      NA, treated
    )
  }
  
  return(treated)
  # data <- data %>% mutate(treated = treated)
  # return(data)
}

Both to verify that everything works well and for illustration, we can make quick plots:

Creating the output if a unit is treated

We then create the output if a unit is treated, Y(1).

create_y1 <- function(dep_var,
                      percent_effect_size = 0.5,
                      quasi_exp = "random_days",
                      pollutant = NULL) {
  
  # y1 <- output_var*(1 + percent_effect_size/100)
  if (str_starts(quasi_exp, ("random|national|local|alert|rolling"))) {
    y1 <- dep_var + rpois(length(dep_var), dep_var * percent_effect_size / 100) %>%
      suppressWarnings() #warnings when is..na(dep_var) eg rpois(1, NA)
  } else if (quasi_exp == "GAM") {
    y1 <- dep_var + rpois(length(dep_var), pollutant * percent_effect_size / 100) %>%
      suppressWarnings()
  }
  return(y1)
} 

Estimate the model

We can then estimate our model and retrieve the point estimate and p-value. The model should be specified in a three part formula as follows: y ~ x | fixed effects | cluster. If one does not want to set fixed effects, a 0 should be put in the second part of the formula: y ~ x | 0 | cluster.

estimate_model <- function(data, formula) {
  #get the different parameters from the formula
  fml <- Formula::as.Formula(formula)
  cluster <- formula(fml, lhs = 0, rhs = 3) %>% 
    suppressWarnings() #when no cluster provided, warning
  actual_fml <- formula(fml, rhs = -3)
  se <- ifelse(cluster == ~0, "hetero", "cluster")  
  
  #run the estimation
  est_results <- data %>% 
    feols(
      data = ., 
      fml = actual_fml, 
      cluster = cluster,
      se = se
    ) 
  
  #retrieve the usefull info
  nobs <- length(est_results$residuals)
  
  est_results %>%
    broom::tidy(conf.int = TRUE) %>%
    filter(term =="treatedTRUE") %>%
    rename(p_value = p.value) %>%
    mutate(estimate = estimate) %>%
    select(estimate, p_value) %>%
    mutate(n_obs = nobs)
} 

Computing simulations

We then create a function running all the previous functions together and therefore performing an iteration of the simulation for a given set of parameters. This function returns a one row data set with estimate, p-value, number of observations and true effect size.

compute_simulation <- function(data,
                               n_days_study = 200,
                               p_treat = 0.5,
                               quasi_exp = "random_days",
                               percent_effect_size = 0.5,
                               formula = "deaths_all_causes ~ treated") {
  
  fml <- Formula::as.Formula(formula)
  dep_var <- paste(fml[[2]])
  
  sim_data <- data %>%
    mutate(study_period = draw_study_period(date, n_days_study)) %>%
    filter(study_period) %>%
    select(-study_period) %>%
    mutate(
      treated = draw_treated(., p_treat, quasi_exp),
      # true_treated = draw_treated(., p_treat, quasi_exp),
      y0 = .data[[dep_var]],
      y1 = create_y1(.data[[dep_var]], percent_effect_size),
      yobs = y1 * treated + y0 * (1 - treated)
      # yobs = y1*true_treated + y0*(1 - true_treated)
    ) %>%
    filter(!is.na(treated)) #not necessary bc dropped in lm()
  # filter(!is.na(true_treated)) #not necessary bc dropped in lm()
  
  #for the local intervention, we estimate a DID 
  #and thus need a post and a city_treated variable
  if (quasi_exp == "local") {
    sim_data <- sim_data %>%
      group_by(city) %>%
      mutate(city_treated = as.logical(max(treated))) %>%
      ungroup() %>%
      group_by(date) %>%
      mutate(post = as.logical(max(treated))) %>%
      ungroup()
  }
  
  #for the nation intervention, we estimate an ITS 
  #and thus need a time index and time index
  if (str_starts(quasi_exp, "national")) {
    sim_data <- sim_data %>%
      group_by(city) %>%
      mutate(
        t = as.numeric(date) - min(as.numeric(date)),
        t_post = max(0, as.numeric(date) - as.numeric(min(.data$date[treated == TRUE])))
      ) %>%
      ungroup()
  }
  
  sim_output <- sim_data %>%
    estimate_model(formula = update(fml, yobs ~ .)) %>%
    mutate(true_effect = mean(sim_data$y1 - sim_data$y0, na.rm = TRUE))
  
  return(sim_output)
}

test <- total_data %>%
  compute_simulation(
    formula = "deaths_na_causes ~ treated",
    n_days_study = 700,
    quasi_exp = "national_short"
  ) 

We will then loop this function to get a large number of replications of each simulation for a given set of parameters. We will also vary the values of the different parameter.

Running the simulations

We can then run the simulations. But first, we define the set of parameters we want to consider.

Defining the parameters

We create a table displaying in each row a set of parameters we want to have a simulation for, sim_param. We will then map our function compute_simulation on this table.

To build sim_param, we first define a set of baseline values for our parameters and store them in a data frame, sim_param_base. We will then vary the values of the parameters one after the other. We thus create vectors containing the different values of the parameters we want to test.

sim_param_base <- tibble(
  n_days_study = 500,
  p_treat = 0.2,
  percent_effect_size = 0.5, 
  formula = "deaths_na_causes ~ treated + temperature + I(temperature^2) | city + month + year + day_of_week"
)

write_csv(sim_param_base, "../Outputs/sim_param_base.csv")

vect_n_days_study <- c(100, 250, 500, 750, 1000)
vect_p_treat <- c(0.01, 0.2)
vect_percent_effect_size <- c(0.1, 0.5, 1)
vect_formula <- c(
  "deaths_na_causes ~ treated", 
  "deaths_na_causes ~ treated | city",
  "deaths_na_causes ~ treated | city + month + year + day_of_week",
  "deaths_na_causes ~ treated + temperature + I(temperature^2) | city + month + year + day_of_week"
)

We then want to create the actual table, varying parameters one after the other. To do so, we create a simple function add_values_param. This function adds the values of a parameter contained in a vector. We can then map this function on all the vectors of parameters of interest.

#adds all values in vect_param
add_values_param <- function(df, vect_param) {
  param_name <- str_remove(vect_param, "vect_")
  
  tib_param <- tibble(get(vect_param))
  names(tib_param) <- param_name
  
  df %>% 
    full_join(tib_param, by = param_name) %>% 
    fill(everything())
}

vect_of_vect_param <- c("vect_n_days_study", "vect_p_treat", "vect_percent_effect_size", "vect_formula")

sim_param_unique <- 
  map_dfr(
    vect_of_vect_param, 
    add_values_param, 
    df = sim_param_base
  ) %>% 
  distinct() 

We want to compute our simulations for this set of parameters for every quasi experiment. Note that, in order to identify the effect of interest, we consider different types of models, depending on the identification strategy considered for each quasi experiment.

Then, for each set of parameters we want to run many iterations of the simulation, n_iter. We thus modify sim_param so that each set of parameter appears n_iter times. It will enable us to loop compute_simulation directly on sim_param.

vect_quasi_exp <- c("random_days", "national", "local", "national_short", "rolling") #, "alert_pm10"
n_iter <- 2

sim_param <- sim_param_unique %>%
  crossing(vect_quasi_exp) %>%
  rename(quasi_exp = vect_quasi_exp) %>%
  mutate(
    formula = case_when(
      quasi_exp == "local" ~ str_replace_all(formula, "treated", 
                                             "treated + post + city_treated"),
      str_starts(quasi_exp, "national") ~ str_replace_all(
        formula, "treated", 
        "treated + t + t_post"),
      str_starts(quasi_exp, "alert") ~ str_replace_all(formula, "treated", paste(
        "treated + ", str_remove(quasi_exp, "alert_")
      )),
      TRUE ~ formula
    )
  ) %>%
  crossing(rep_id = 1:n_iter) %>%
  select(-rep_id) %>% 
  mutate(sim_id = row_number())

# write_csv(sim_param, "../Outputs/sim_param.csv")

Running the simulation

We can then run the simulations for each set of parameter, using a pmap_dfr function.

tic()
simulations <- future_pmap_dfr(
  sim_param %>% select(-sim_id), 
  compute_simulation, 
  data = total_data,
  .id = "sim_id",
  .options = furrr_options(seed = TRUE)
) 
toc()

all_simulations <- simulations %>% 
  as_tibble() %>% 
  mutate(sim_id = as.numeric(sim_id)) %>% 
  left_join(sim_param, by = "sim_id") %>% 
  select(-sim_id)

# saveRDS(all_simulations, "../Outputs/data_simulations.RDS")

Summarising the results

We then summarize our results, computing power, type M and so on for each set of parameters using he function summarise_simulations. Note that this function can only take as input a data frame produced by compute_simulation (or a mapped version of this function).

summarise_simulations <- function(data) {
  
  data %>% 
    group_by(formula, quasi_exp, n_days_study, p_treat, percent_effect_size) %>%
    summarise(
      power = mean(p_value <= 0.05, na.rm = TRUE)*100, 
      bias = mean(abs((estimate - true_effect)/true_effect), na.rm = TRUE),
      average_p_value = mean(p_value, na.rm = TRUE),
      average_n_obs = mean(n_obs, na.rm = TRUE),
      type_m = mean(ifelse(p_value < 0.05, abs(estimate)/true_effect, NA), na.rm = TRUE),
      type_s = sum(ifelse(p_value < 0.05, sign(estimate) != sign(true_effect), NA), na.rm = TRUE)/n()*100,
      .groups  = "drop"
    ) %>% 
    ungroup()
} 

We then run this function.

summary_simulations <- summarise_simulations(all_simulations)

# saveRDS(summary_simulations, "../Outputs/summary_simulations.RDS")